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The Case Against Tests of Statistical Significance 

Morris Lai 

Far West Laboratc^ for Educational Research and Development 

The purpose of this paper Is to (1) describe some of the serious short- 
comings in the^ current use of tests of statistical* significance, (2) discuss 
how misuses are perpetuated in some widely used references, and (3) present, 
an aUemative significance testing model that overcomes some, but not all, 
^ of the shortcomings of the currently used method. 

Defining "testing statistical significance" 

For the purposes of this paper, the discussion will be restricted to * ' 
fixed effects analysis of variance (ANOVA) /including t-terte), which is 
perhaps the most pervasive of the datdNanalyses used by educational researchers 
A test of statistical significance Is basically a process whereby two or more 
groups are pompared, and for whatever difference is found, a *'p value" is 
calculated which is the probabi 1 1 ty that a difference that large or larger 
would have arisen in a sample had the groups been truly equivalent as 
populations. s 




^ test Statistic 

Distribution of test statistic when groups are 
equivalent in the population . 
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test Statistic 



Observed value of test statistic 

For observed test statistics that are sufficiently large, the p values 
are correspondingly small (i.e., statistically significant). 

Random assignment 

Such a model requires, to start off with, random sampling. If assign^ 

i . . ' " 

ment to treatment is not random, then a test of slgn^iflcance is i inappropriate 
(Morrison & Henkel, 1969). 

Type i error rate 

Nearly every textbook on Inferential statistics discusses the concept 
of Type I and Type II errors. Despite warnings from Horst (1966), Skipper 
et al. (1967), and Winer (1971) about the Inapproprlateness of endowing 
Type I error rates of .05 and .01 with some sort bf sacredn'dss, the pre- . . 
valence of such sacredni^ss Is well known (e.g., the APA Publication manual 
advocates one asterisk for jS < .05 and two asterisks for p < .01). 

Practical or educational significance 

. It "is popular tod^ to exhibit some enlightenment by emphasizing that 
statistical significance does not necessarily imply practical or educational » 
significance. ,Yet in Gujl ford's (1956) widely used textbook we find the 
following quote: (p. 275) 



^ The F ratio for machines is significant beyond the .01 level, 
leaving us with considerable confidence that the machine 
differences^ as such, have a real bearing upon the difficulty 
of the task. / 

Such a significant F could have resulted where the differences wei^ trivial 

in the practical sense, pother misuse of p levels occurs when researchers" 

use significance levels to con^idre results from several studies (e.g., 

Eysenck, 1960; Br^icht, 1970). 

Type II error rates, power, and accepting null ^potheses 
• ^ Type II error rates and power calculations are less familiar to 
researchers, \iyone who accepts a null hypothesis, without knowing the 
power of the statistical test, is liable to have a huge Type II errar rate. 
Yet Popham (1967) in his text writes "...hypothesis under consideration is 
either accepted or rejected." Glass and Stanley (1970 also mislead their 
readers by advocating, without consideration of power, the acceptance of 
the null hypothesis when statistical significance 1s1iot attained. Other 
jfrriters who advocate (inappropriately) the accepting of null hypotheses 

if a significant statistic is not observed include Walter and Lev (1953) 

\ , ' . ■ 

Guilford (1956). a)(id Kirk. (1968).- 

It is possiblfe to prove algebraically that for a predetermined level 

of significance, there exist normal distributions such that the F or t 

statistic will not tie significant, but the size of the effects will be 

larger than any predetermined number. As such, a researcher who accepts 

a null hypothesis without knowing the power of the test may be -call ing a 

very large difference a "zero difference." McNemar*s (1962) suggestion of 

using three regions (acceptance, suspended judgment, and rejection), 

depending on the p level, does not overcome this objection. > ' 



Sample size 

Another problem that I >(*m discuss Is determining sample size. Any 
scientist appreciates the fact that the larger the sample, 'the more 1nf<)r-' 
mation one has. Aside from cost-benefit considerations and manaQeability, 
it is illogical to sj^y that a sAwiller sample |s more desirable than a larger 
one; for example, Heiys (1963) clearly states that for precision, the bigger 
the sample site the better. ,Yet on the next pcqe (p. 3^4) he suggests that 
the iTesearcher^ask the following question: "Is the sample size large , 
enough to give confidence that the big associations will indeed show up, 
while being small enough so that trivial associations will be excluded 
from significance?" If a procedure is such that it results in worry aboyt 
whethier a sample size is small enough, then surely something is seriously 
wrong with that procedure. 

Appropriate null hypotheses ^ ; . 

. The last problem I will discuss deals with null hypotheses. The un- 
questioning arcceptance of always ^ difference null hypothesis has 
been criticized by several writers (e,g.» Grant (1962); Kerllnger (191^); 
Cohen (1969). Dixon and Massey (1969) arid Pena (1970) have both presented 
a procedure for testing non-zero null hypotheses for the two sample case. 
The 1ncor})orat1on of 3 predetermined minimum practical difference Into the 
null hypothesis (now non-zero) ties In the statistical and practical 
significance. ' means of this rarely used procedure » a researcher can 
state more appropriate null hypotheses.^ Instead of asklfig if there Is a 
, difference at all, researchers usdfclly should be asking whether or not 
there is an^educatiional or practical difference* Instead of asking whether 



a. Datsun gets better mileage than a Cadillac, we should be asking how many 
more gallons a Datsun gets and whether this dlffer^ce was of practlca] 
Importance. Likewise Instead of asking whether one group has scored higher 
than anotheri^ we should be asking how much higher one group ha$ scored 
than another and if this difference Is of practical or educational Importance. 

Summary . ^ • . * 

In summary t well respected writers have suggested that researchers do 
the following (1) test null hypotheses that are usually inappropriate, (2) 
accept these null hypotheses without regard to power (and possibly have 
huge Type n errors), (3) use arbitrary (sacred) rejection probability 
levels of .05' and .01, and (4) be careful In not getting too large a sample 
size. I 

These misleading (inappropriate) recownendations are interrelat^id in 
that their disappearance wo.uld be highly correlated with the elimination 
of tests of significance. But change comes slowly ^and I propose an 
analysis of variance methodology that gets rid of (1) and (4) (inappropriate 
null hypotheses and .the illogical concept of a sample being too large.) 

Noncentral analysis of variance 

The method can perhaps be best understood in terms of Its being an 
extension of the two sample case \i\\ich ^as been described by 01xoi> and 
Massey (1969). The analog to the minimum practical difference, is <s , the 
noncentral ity parameter of the noncentral F distribution. Oust as the 
ofdinairy F distribution is associated-with a zero difference null hypothesis, 
the noncentral F distribution is associated with a non-zero null hypothesis. 
Minimum practical differences are now stated in terms of average differences 
between groups. . 



The derivation of the noncentral ANOVA model Is complex and will be 
presented in more detail In another paper. The use, however, Is rather 
Simple. Having determined the minjimum practical difference, a researcher 
need only use a table to determirie ,the noncentrality parameters.' He then 
rejects the (nonzero) null hypothesis if his observed F statistic exceeds 
Fv p V2» 6 O-a), where v^and are the usual parameters that determine 
the central F distribution, 6 is the noncentrality parameter and a is the 
Type I error rate chosen. 

Such a procedure results In an appropriate adjustment for sample size 
Thus, statistical significance Is not attainable by merely increasing the 
sample size. The inogical concept of too large a sample no longer exists 
At the same time, appropriate null hypotheses are beinq tested. 
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